Bayesian Analysis of the risk of collision of asteroids with the Earth


There is an infinite number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena.

These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those.

The objective of the analysis is to discover whether an asteroid is potentially dangerous for the earth or not.
In order to do that, we have different Machine Learning algorithms at our disposal to do binary classification.

The dataset analyzed for this analysis is taken from Kaggle and contains more than 90’000 features regarding 27’000 objects in the space.

Three statistical models for classification are applied:
-Probit Regression
-Logistic Regression (with Logit link function)
-Complementary log-log regression (cloglog)

The parameters of the models are tuned with Monte Carlo Markov Chain methods, and a fully bayesian in-depth analysis is applied to them.

The analysis is divided in 4 parts:

  1. Exploratory Data Analysis and Feature Engineering
  2. Application of three different statistical models for binomial classification: Probit Regression, Logistic Regression (Logit) and Complementary log-log Regression. Over these three models, we are gonna observe:
  1. Choice of the best model among the six analyzed
  2. Conclusions

Now, let’s move on.

Exploratory Data Analysis and Feature Engineering

ID Name MinDiameter MaxDiameter RelativeVelocity MissDistance OrbitingBody SentryObject AbsoluteMagnitude Hazardous
2162635 162635 (2000 SS164) 1.198 2.679 13569.25 54839744 Earth False 16.73 False
2277475 277475 (2005 WK4) 0.266 0.594 73588.73 61438127 Earth False 20.00 True
2512244 512244 (2015 YE18) 0.722 1.615 114258.69 49798725 Earth False 17.83 False
3596030 (2012 BV13) 0.097 0.216 24764.30 25434973 Earth False 22.20 False
3667127 (2014 GE35) 0.255 0.570 42737.73 46275567 Earth False 20.09 True

The dataset is composed by 10 variables:

-ID: ID number of every asteroid
-Name: the name of the asteroids
-MinDiameter: the minimum diameter estimated of the asteroids (in km)
-MaxDiameter: the maximum diameter estimated of the asteroids (in km)
-RelativeVelocity : the velocity relative to the Earth (in km/h)
-MissDistance: the distance between the Earth and the asteroid (in km)
-OrbitingBody: the planet that the asteroid orbits
-SentryObject: (binary variable) explains if an asteroid is included in an automated collision monitoring system
-AbsoluteMagnitude: intrinsic luminosity of the asteroid. The more negative is the magnitude, the brighter is the asteroid (in vmag)
-Hazardous: Boolean feature that shows whether asteroid is harmful or not. It is be the output variable and the analysis will be based on the prediction of it.

Data cleaning

Of course, before analyzing the data, I need to clean them in order to remove useless information.

-1. ID and Name are useless information, so I will remove them. Also OrbitingBody and SentryObject have the same value for every row. I can remove them.
-2. OrbitingBody, False and the output Hazardous are binary variables (True/False), so I’m gonna transform them into \(0\) and \(1\).

After observing that OrbitingBody is composed \(100\%\) by ‘Earth’ and that SentryObject is only False, I’ve dropped these columns

Data Visualization and Exploration

One of the most important steps, after understanding the features meaning and a first data cleaning, is the Data Visualization.
The first data I want to observe is the distribution of Riskful and Non-Riskful asteroids over the labels

We can observe that only approximately one asteroids over ten is potentially hazardous.
So now we can pass to the observation of these asteroids.
The Visualization’s first purpose is to compare the features of the dangerous objects with the non dangerous.

As shown in the plot above, the risky asteroids are more likely to be “fast”. In fact, the faster is an asteroid, the higher is the potentially risk of it.

Looking at the maximum diameter of the asteroids we can notice that the risky ones tend to have an higher diameter, and it causes higher danger. The curves have been approximated with spline method.

Of course the magnitude seems to be a very significant variable for the analysis! the riskful asteroids have a definitely smaller magnitude.
It means that they bright more than the not risky ones, and this brightness causes peril.

Looking at the boxplot of the distances of the objects, we can notice that the maximum distance is the same for both riskful and not: it’s because we are selecting the asteroids that fluctuate around the Earth, so they are no more distant than approximately 75 millions kilometers.
Instead, looking at the overall distribution and minimum distance, the riskful ones are clearly closer to the Earth.

From the correlation matrix we can observe that MinDiameter and MaxDiameter have the maximum correlation possible. It’s because they explain the same information, and keeping both may be useless for the analysis.
This part will be reviewed in the feature engineering.

The AbsoluteMagnitude is negatively correlated with the diameter: it’s because the higher is an object, the higher may be its bright and so the lower is the magnitude.

Feature Engineering

Incorporation of same variables: MinDiameter and MaxDiameter represent the same variable (the diameter of the asteroids). Keeping both these variables may cause redundances and unuseful calculations.
Given that these two measures are hypothetical, I’m gonna incorporate them into an only variable: HypotheticalDiameter.

Standardization of the variables: some variables are too big, so I need to standardize them in order to increase the efficiency of the algorithms. This step will be done after the data visualization.

The two main reasons why it’s important to standardize these variables are
-standardization increases the efficiency of the further algorithms.
-the data have different unit of measure

The final Dataset

RelativeVelocity MissDistance AbsoluteMagnitude HypotheticalDiameter Hazardous
-1.364 0.795 -2.349 3.587 0
1.009 1.090 -1.219 0.464 1
2.617 0.570 -1.969 1.992 0
-0.921 -0.520 -0.459 -0.104 0
-0.211 0.412 -1.188 0.427 1

The Inferential Analysis

The purpose of the analysis is to forecast whether an asteroid is potentially hazardous for our planet or not.

Before studying the distributions, the division in train and test set is fundamental. I randomly split the dataset into a \(80\%\) train-set and \(20\%\) test-set.

The proportions are approximately the same, so I can proceed with the analysis.

1. Probit regression

The best way to predict the hazard of an asteroid is to apply the binary classification algorithms: the first approach is the implementation of a probit model.
The model takes the following form:

\[P(Y=1|X)=\Phi(X^T\beta)\]

Where \(P\) denotes the probability and \(\Phi\) denotes the cumulative distribution function of the Normal Standard distribution (\(N(0,1)\)), that is used to model the regression function. Instead, the parameters \(\beta\) have to be estimated.

It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable:

\[Y^*=X^T\beta + \epsilon\] where \(\epsilon \sim N(0,1)\), then \(Y\) can be viewed as an indicator for whether the latent variable is positive.

\[ \mathrm{Y^*} = \begin{cases} 1 & \text{if } Y^* > t \\ % & is your "\tab"-like command (it's a tab alignment character) 0 & \text{otherwise} \end{cases} \]

Implementation of the model with RJags

Gibbs sampling of a probit model is possible because regression models typically use normal prior distributions over the weights, and this distribution is conjugate with the normal distribution of the errors (and hence, but not in this case, of the latent variables Y*). The model can be initialized as

\[\beta_i \sim N(0, 0.0001)\]

\[ Y_i \sim Ber(probit(p_i)) \]

N = length(train$Hazardous)
probit_reg <- function(){
  # Likelihood
  for (i in 1:N){
    y[i] ~ dbern(p[i])
    probit(p[i]) =  beta0 + beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i]
  }
  
  #Prior beta parameters
  beta0 ~ dnorm(0, 0.0001)
  beta1 ~ dnorm(0, 0.0001)
  beta2 ~ dnorm(0, 0.0001)
  beta3 ~ dnorm(0, 0.0001)
  beta4 ~ dnorm(0, 0.0001)
}
set.seed(123)
model1 = jags(data = data_sim,
             model.file = probit_reg,
             parameters.to.save = params,            
             n.chains = 2,
             n.iter = 10000, 
             n.burnin = 1000, 
             n.thin = 10) 
μ.vect σ.vect Quantile_0.25 Quantile_0.50 Quantile_0.75 R.hat N_eff
β_0 -2.148 0.043 -2.163 -2.150 -2.138 1.001 1800
β_1 0.116 0.008 0.111 -0.116 0.121 1.001 1800
β_2 -0.234 0.011 -0.241 -0.234 -0.227 1.004 450
β_3 -1.626 0.052 -1.645 -1.629 -1.612 1.001 1800
β_4 -0.638 0.026 -0.650 -0.640 -0.629 1.001 1800
Deviance 32127.261 319.914 32113.733 32115.356 32117.769 1.000 1800

DIC = 83328.3 \(p_D\) = 51201.0

The parameters shown in “\(\mu.vect\)” are the approximation of the real parameters estimated by the MCMC.

Instead, the “\(\sigma.vect\)” represents the standard deviation and can be interpreted as posterior uncertainty of the parameters: \(\beta_3\) has in fact the highest uncertainty, while \(\beta_1\) has the lowest.

The parameter \(\hat R\) tells me if there is convergence for the parameters: the closer is \(\hat R\) to 1, the highest is the plausibility of a convergence.

The DIC (Deviance Information Criterion) is an asymptotic approximation as the sample size becomes large (it’s a generalization of AIC for the frequentist approach). It’s represented as

\[DIC = D_{\hat \theta}(y)+2p_D = \hat D_{avg}(y)+p_D= \\ 2\hat D_{avg}(y)-D_{\hat \theta}(y)\] Where
\(D_{\hat\theta}(y)=-2log f(y,\hat\theta(y))\),
\(\hat D_{avg}(y) \approx\frac{1}{M}\sum_j -2logf(y|\theta^{(j)})\)
and \(p_D\) is the effective number of parameters.

The minimum DIC estimates the “best” model in the same spirit as AIC.
The DIC is a comparative index, so I will compare it with all the models in order to choose the best one.

Trace-plots

Let’s observe now the stationary regions

The traceplot explains the pattern followed by the parameter for every iteration of the Markov Chains.
From the plot we can see that the processes look stationary: it means that the trend of the parameter, with infinite iterations, becomes constant.

Empirical Average

If \(I<\infty\) and \(\theta_1,...\theta_t\) iid, for the Strong Law of Large Numbers (SLLN), the empirical average is a consistent (sequence of) estimator(s) of \(I\), such as \[ \hat{I}_{t} = \frac{1}{T} \sum_{i=1}^{T} h(\theta_i)\xrightarrow{\text{a.s.}} I \] So I implement the estimator in the following way and apply it to every parameter.

Each parameter’s Empirical Average converge to the same point: it means that, also with different initial points for the two chains, the estimated mean parameter will be approximately the same.

Density

From the density plot we can observe the distribution of the estimated parameters.

The mean of the distribution represents the estimate of the parameters (“\(\mu.vect\)”). The higher is the curve on a certain x, the more plausible is that value as parameter. As we can observe, the distributions are mainly concentrated around the estimated parameters, but another way to study the probability of the parameters is with the Credible Intervals.

There are two types of credible intervals: Equal-Tail and Highest Posterior Density (HPD).

If the distribution is not unimodal and symmetric there could be points points out of the ET interval having a higher posterior density than some points of the interval. So in this case it’s better to study the Highest Posterior Density interval (HPD).

The particularity of the HPD interval is that the posterior density for every point in the confidence region \(I_{\alpha}\) is higher than the posterior density for any point outside of this set.

The HPD intervals at level \(\alpha = 0.05\) are the following

lower upper
beta0 -2.2048209 -2.1278060
beta1 0.1027309 0.1314259
beta2 -0.2548592 -0.2171067
beta3 -1.7178587 -1.6179294
beta4 -0.7133528 -0.6489877
deviance 32236.4756045 32248.1679345

The intervals effectively contain the values of the estimated parameters.

Autocorrelations

The autocorrelation is the is the Pearson correlation between values of the process at different times, as a function of the two times or of the time lag.

beta0 beta1 beta2 beta3 beta4 deviance
Lag 0 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Lag 10 0.3612700 0.0001370 0.0623235 0.3980965 0.3595515 0.0438963
Lag 50 0.0332511 -0.0127798 0.0135061 0.0364965 0.0278903 0.0001390
Lag 100 -0.0152833 0.0222933 -0.0405888 -0.0126048 -0.0118249 0.0002038
Lag 500 0.0117260 -0.0211545 0.0179715 0.0066646 0.0034373 0.0000932

From the ACF plots and the table we observe that, with the augmenting of the iterations (and so of the lag), the autocorrelation decreases until around zero. It means that there is not correlation between different values of the chain.
We can observe that the autocorrelation of \(\beta_1\) and \(\beta_2\) decreases faster than the other parameters.

Correlation between parameters

Above we’ve seen the plots regarding the behavior of the Markov Chains from different point of view.
Furthermore, looking closely to the graphs, we can notice that \(\beta_0,\beta_3\) and \(\beta4\) follow very similar patterns and have a similar distribution.
These aspects may suggest us to look at the correlation between them and to see whether these values are high:

As expected, there is high correlation between them!

Approximation Error

To evaluate the approximation error I use the MCSE estimate: the MCSE (Monte Carlo Standard Error) is an estimate of the inaccuracy of Monte Carlo samples, usually regarding the expectation of posterior samples from Monte Carlo Markov Chain algorithms.

I estimated the MCSE with the \(MCSE\) function from LaplacesDemon package obtaining the following results:

Approximation_Error
β_0 0.0030455
β_1 0.0002415
β_2 0.0004255
β_3 0.0037533
β_4 0.0018643

The highest approximation error is the \(\beta_3\)’s, and the lowest is the \(\beta_1\)’s.

Predictions

Now that the parameters have been estimated, it’s time to do the first forecast on the test set:

#Parameters
beta0 = model1$BUGSoutput$summary["beta0", "mean"]
beta1 = model1$BUGSoutput$summary["beta1", "mean"]
beta2 = model1$BUGSoutput$summary["beta2", "mean"]
beta3 = model1$BUGSoutput$summary["beta3", "mean"]
beta4 = model1$BUGSoutput$summary["beta4", "mean"]

x = test[1:4]
y = test[,5]
XTb = beta0 + beta1*x[1] + beta2*x[2] + beta3*x[3] + beta4*x[4]

#predictions for a given threshold t
predictions_t = function(xTb, t){
  preds = rep(NA, length(XTb[,1]))
  predictions = rep(NA, length(XTb[,1]))
  for(i in 1:length(XTb[,1])){
    preds[i] = pnorm(XTb[i, 1])
  }
  for(i in 1:length(XTb[,1])){
    if(preds[i]>t)
      predictions[i] = 1
    else
      predictions[i] = 0
  }
  return(predictions)
}

Before selecting the best threshold, it’s good to observe the ROC curves for the different thresholds in order to choose the one with the best performance for the problem.

In fact, the main objective of the task is to discover as most hazardful asteroids as possible increasing the True Positive Rate (TPR/Recall) and Precision, trying to avoid false negatives (decreasing FNR), also at cost to sacrifice some false positives (FPR).

It’s better to know which objects are going to collide with the Earth, than knowing which are not dangerous!

The best threshold for this task is \(t=0.35\).

In fact it gives us the following values:

  • Accuracy: \(87.84\%\)
  • Precision: \(94.18\%\)
  • Recall: \(92.49\%\)
  • AUROC: \(61.44\%\)
  • F1-Score: \(93.33\%\)

The confusion matrix for the threshold is the following

Testing

1. Geweke Diagnostic is based on a test for equality of the means of the first and last part of a Markov Chain (by default the first 10% and the last 50%) .

The null hypothesis in the test is that two parts of the chain come from the same distribution, and to study it it’s used a mean difference test.
If the two averages are equal it is likely that the chain will converge.

The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error adjusted for autocorrelation.

Z.score_Chain1 Z.score_Chain2
β_0 1.377 0.9585
β_1 -1.303 -1.9106
β_2 1.315 0.8069
β_3 1.325 0.9575
β_4 1.412 1.0057
deviance 1.061 1.0609

The plot shows that almost all the Z-scores evaluated are into the acceptance area, so we accept the null hypothesis.

2. Raftery & Lewis Diagnostic estimates the number of iterations needed for a given level of precision in posterior samples.

What we want to measure is some posterior quantile of interest q.

If we define some acceptable tolerance \(r\) for \(q\) and a probability \(s\) of being within that tolerance, the Raftery and Lewis diagnostic will calculate the number of iterations \(N = T_{min}\) and the number of burn-ins \(M\) necessary to satisfy the specified conditions.

The results are:
Quantile = 0.025
Accuracy = +/- 0.005
Probability of Accuracy = 0.95

3746 samples are needed for the values above.

3. Heidelberger & Welch Diagnostic is a convergence test that test whether the sampled values come from a stationary distribution.

It’s applied firstly to the whole chain, then discards \(10\%\) of the chain until the null hypothesis is discarded. If the \(50\%\) is discarded, we reject the null hypothesis.

Stationarity_test_chain1 Start_iteration_chain1 p.value_chain1 Stationarity_test_chain2 Start_test_chain2 p.value_chain2
β_0 passed 91 0.454 passed 91 0.1845
β_1 passed 1 0.733 passed 1 0.1113
β_2 passed 1 0.203 passed 1 0.3172
β_3 passed 91 0.233 passed 181 0.0939
β_4 passed 91 0.207 passed 91 0.8046
deviance passed 1 0.577 passed 1 0.6052
Halfwidth_test_chain1 Mean_chain1 Halfwidth_chain1 Halfwidth_test_chain2 Mean_chain2 Halfwidth_chain2
β_0 passed -2.167 2.68e-03 passed -2.169 2.78e-03
β_1 passed 0.117 4.94e-04 passed 0.117 4.96e-04
β_2 passed -0.235 7.69e-04 passed -0.236 8.05e-04
β_3 passed -1.672 3.61e-03 passed -1.675 3.71e-03
β_4 passed -0.682 1.98e-03 passed -0.684 1.94e-03
deviance passed 32252.650 2.10e+01 passed 32252.524 2.10e+01

Every test has been passed succesfully.

Comparison with the Frequentist approach

After observing the behaviour and the goodness of the parameters, it’s time to compare the obtained model with the estimated parameters of the frequentist approach.
In order to do that I decide to use the \(glm\) package from R.

freq_probit = glm(Hazardous ~ HypotheticalDiameter + RelativeVelocity + MissDistance + AbsoluteMagnitude, family = binomial(link = "probit"),data=train)

AIC = 32250

The estimated parameters are pretty close to the parameters studied in the Bayesian approach.

In order to compare the models, I use the same techniques used above (ROC curve + evaluation of the indices from the confusion matrix)

The best threshold is \(t=0.2\) that gives me the following results:

  • Accuracy: \(87.57\%\)
  • Precision: \(93.50\%\)
  • Recall: \(92.79\%\)
  • AUROC: \(62.86\%\)
  • F1-Score: \(93.14\%\)

The confusion matrix for the threshold is the following

2. Logistic Regression (Logit)

Another way to estimate binary outputs is the Logistic Regression with the logit link function. The assumptions are the following:

\[Y_i \sim Ber(logit(p_i))\] \[logit(p_i)=log\bigg(\frac{p_i}{1-p_i}\bigg)=\beta_0 + \beta_1x_{1_i} + \beta_2x_{2_i}+\beta_3x_{3_i} + \beta_4x_{4_i}\]

Also in this case the prior parameters \(\beta\) will have a prior distribution with \(\mu = 0\) and \(\sigma = 0.0001\)

\[\beta_i\sim N(0,0.0001)\]

for \(i=1,2,3,4\).

Implementation of the model with RJags

As done before, the model is implemented with RJags

N = length(train$Hazardous)
log_reg <- function(){
  # Likelihood
  for (i in 1:N){
    y[i] ~ dbern(p[i])
    logit(p[i]) =  beta0 + beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i]
  }
  
  #Prior beta parameters
  beta0 ~ dnorm(0, 0.0001)
  beta1 ~ dnorm(0, 0.0001)
  beta2 ~ dnorm(0, 0.0001)
  beta3 ~ dnorm(0, 0.0001)
  beta4 ~ dnorm(0, 0.0001)
}
set.seed(123)
model2 = jags(data = data_sim,
             model.file = log_reg,
             parameters.to.save = params,            
             n.chains = 2,
             n.iter = 10000, 
             n.burnin = 1000, 
             n.thin = 10) 
μ.vect σ.vect Quantile_0.25 Quantile_0.50 Quantile_0.75 R.hat N_eff
β_0 -3.980 0.102 -4.012 -3.984 -3.959 1.000 1800
β_1 0.198 0.013 0.189 -0.199 0.208 1.002 1800
β_2 -0.455 0.021 -0.467 -0.456 -0.444 1.001 1800
β_3 -3.315 0.123 -3.356 -3.321 -3.287 1.000 1800
β_4 -1.528 0.073 -1.560 -1.531 -1.503 1.001 1800
Deviance 32486.199 414.230 32468.279 32469.978 32472.291 1.000 1800

DIC = 118166.2

Also in this case there is convergence for the parameters, but they are higher in module than the probit ones.
\(\beta_3\) and \(\beta_0\) have the highest posterior uncertainty.

Trace-plot

The behavior of the parameters in the simulation is pretty similar to the probit’s behavior, the only difference is in the mean of the parameters (and so the “line where they turn around”).

Anyway, all the processes look stationary.

Empirical Average

The empirical averages follow a pattern very similar to the parameters of the probit, wit the difference that \(\beta_1\) has a cleaner behavior and converges faster to I.

Any par #### Density

For the same reason as before, I prefer to use the HPD interval instead of EQ.

lower upper
beta0 -4.0993935 -3.9340667
beta1 0.1756789 0.2270450
beta2 -0.4898944 -0.4209132
beta3 -3.4782841 -3.2677423
beta4 -1.6909881 -1.5244092
deviance 32494.8289698 32506.8282006

Autocorrelations

There is a difference between the probit parameters and the logit ones: \(\beta_0, \beta_3\) and \(\beta_4\) have a slower convergence and an higher autocorrelation.

beta0 beta1 beta2 beta3 beta4 deviance
Lag 0 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Lag 10 0.5155044 0.0103044 0.1591531 0.5766383 0.5811649 0.0936733
Lag 50 0.1216148 0.0073981 0.0264474 0.1434062 0.1595974 0.0031139
Lag 100 0.0117419 -0.0230559 -0.0013087 0.0180058 0.0281247 -0.0002063
Lag 500 0.0109224 -0.0083524 0.0005585 0.0103798 0.0069378 -0.0003884

Correlation between parameters

The correlation of \(\beta_0\) and \(\beta_3\) with the other parameters is higher than the \(\beta_0\) and \(\beta_3\) of the probit.

Approximation Error

The MCSEs for the logistic regression are the following.

Approximation_Error
β_0 0.0099490
β_1 0.0004612
β_2 0.0010909
β_3 0.0127902
β_4 0.0078298

The comparison of MCSEs is important because we can measure the reliability of the parameters of the different model. In this case, the parameters \(\beta_0,\beta_1\) and \(\beta_2\) are better in the probit model, while \(\beta_3\) and \(\beta_4\) are better in the logistic regression.

Predictions

Now it’s time to evaluate the model from the point of view of the forecasting.

#Parameters
beta0 = model2$BUGSoutput$summary["beta0", "mean"]
beta1 = model2$BUGSoutput$summary["beta1", "mean"]
beta2 = model2$BUGSoutput$summary["beta2", "mean"]
beta3 = model2$BUGSoutput$summary["beta3", "mean"]
beta4 = model2$BUGSoutput$summary["beta4", "mean"]

x = (test[1:4])
y = test[,5]
#eta
XTb = beta0 + beta1*x[1] + beta2*x[2] + beta3*x[3] + beta4*x[4] 

preds = rep(NA, length(XTb[,1]))

predictions_t = function(xTb, t){
  #preds = rep(NA, length(XTb[,1]))
  predictions = rep(NA, length(XTb[,1]))
  for(i in 1:length(XTb[,1])){
    preds[i] = 1/(1+exp(-xTb[i,1]))
  }
  for(i in 1:length(XTb[,1])){
    if(preds[i]>t)
      predictions[i] = 1
    else
      predictions[i] = 0
  }
  return(predictions)
}

By looking at the ROC curves we can now choose the best model for the task:

The best threshold for this task is \(t=0.25\).

In fact it gives us the following values:

  • Accuracy: \(84.78\%\)
  • Precision: \(88.18\%\)
  • Recall: \(94.59\%\)
  • AUROC: \(70.58\%\)
  • F1-Score: \(91.27\%\)

Comparison with the frequentist approach

freq_logit = glm(Hazardous ~ HypotheticalDiameter + RelativeVelocity + MissDistance + AbsoluteMagnitude, family = binomial(link = "logit"),data=train)

AIC = 32500

Also in this case the best threshold is \(t=0.25\), giving the following values:

  • Accuracy: \(86.25\%\)
  • Precision: \(91.06\%\)
  • Recall: \(93.53\%\)
  • AUROC: \(66.19\%\)
  • F1-Score: \(92.28\%\)

3. Complementary log-log regression

Unlike logit and probit, the complementary log-log (cloglog) model is asymmetrical and it is frequently used when the probability of an event is very small or very large.

The assumptions in this case are

\[ Y_i \sim Ber(cloglog(p_i)) \]

\[ cloglog(p_i) = log(−log(1−p_i)) = \beta_0 + \beta_1x_1 + \beta_2x_2 +\beta_3x_3 +\beta_4 x_4 \]

And the assumption for the prior distribution of \(\beta\) is

\[ \beta_i \sim N(0,0.0001) \]

for \(i=1,2,3,4\)

Implementation of the model with RJags

N = length(data$Hazardous)
cloglog <- function(){
  # Likelihood
  for (i in 1:N){
    y[i] ~ dbern(p[i])
    cloglog(p[i]) <-  beta0 + beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i]
  }
  
  beta0 ~ dnorm(0, 0.0001)
  beta1 ~ dnorm(0, 0.0001)
  beta2 ~ dnorm(0, 0.0001)
  beta3 ~ dnorm(0, 0.0001)
  beta4 ~ dnorm(0, 0.0001)
  
}
set.seed(123)
model3 = jags(data = data_sim,
              model.file = cloglog,
              parameters.to.save = params,            
              n.chains = 2,
              n.iter = 10000, 
              n.burnin = 1000, 
              n.thin = 10) 
μ.vect σ.vect Quantile_0.25 Quantile_0.50 Quantile_0.75 R.hat N_eff
β_0 -3.920 0.036 -3.943 -3.920 -3.896 1.000 1800
β_1 0.158 0.011 0.150 -0.158 0.165 1.002 810
β_2 -0.404 0.015 -0.414 -0.405 -0.395 1.002 1800
β_3 -3.090 0.046 -3.120 -3.090 -3.059 1.001 1800
β_4 -1.540 0.038 -1.565 -1.540 -1.515 1.001 1800
Deviance 32693.754 3.075 32691.502 32693.086 32695.484 1.000 1

DIC = 32698.5

As far as posterior uncertainty is concerned, it is pretty similar to the previous ones, but the interesting behavior of this chain regards the diagnostics:

Trace-plot

From the plots the chains looks stationary.

Empirical Average

The moving-mean plots are less smooth than the previous, in fact the convergence arrives lately, especially for the second chain (the dark-green chain).

Density

Even if the distribution may look a little bit symmetrical, it’s not

So I evaluate the HPD intervals as done before

lower upper
beta0 -3.9981032 -3.8552132
beta1 0.1356655 0.1781634
beta2 -0.4303237 -0.3777290
beta3 -3.1740849 -2.9889132
beta4 -1.6182374 -1.4711771
deviance 32689.0941526 32699.0877019

Autocorrelations

beta0 beta1 beta2 beta3 beta4 deviance
Lag 0 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Lag 10 0.5945371 0.0123484 0.0192054 0.7165430 0.5516016 0.0889108
Lag 50 0.1612906 -0.0195376 0.0092633 0.1670784 0.1241136 0.0025466
Lag 100 0.0148105 0.0266597 0.0005488 0.0118273 0.0022424 0.0123025
Lag 500 -0.0123488 0.0017639 -0.0158581 0.0014514 0.0277657 -0.0230362

The autocorrelation plots are pretty similar to the logit plots: there is so similar correlation between values of the chains at different states.

Correlation between parameters

In this case the parameters are less correlated: in fact \(\beta_0\) is less correlated with \(\beta_2\) and the deviance. The overall correlation is lower than the previous for every parameter.

Approximation Error

In this case the MCSEs are higher than the probit ones, but lower than some AE of the logit.

Approximation_Error
β_0 0.0035653
β_1 0.0004490
β_2 0.0005642
β_3 0.0048462
β_4 0.0035437

\(\beta_0\) has the biggest approximation error and \(\beta_1\) has the lowest.

Predictions

preds = rep(NA, length(XTb[,1]))
XTb = beta0 + beta1*x[1] + beta2*x[2] + beta3*x[3] + beta4*x[4]



predictions_t = function(xTb, t){
  predictions = rep(NA, length(XTb[,1]))
  for(i in 1:length(XTb[,1])){
    preds[i] = 1-exp(-exp(xTb[i,1]))
  }
  for(i in 1:length(XTb[,1])){
    if(preds[i]>t)
      predictions[i] = 1
    else
      predictions[i] = 0
  }
  return(predictions)
}

For the best threshold \(t=0.3\), the values obtained are

  • Accuracy: \(85.16\%\)
  • Precision: \(88.99\%\)
  • Recall: \(94.24\%\)
  • AUROC: \(69.17\%\)
  • F1-Score: \(91.54\%\)

Comparison with frequentist approach

The final comparison is with the frequentist cloglog regression.

train$Hazardous = as.numeric(train$Hazardous)
freq_cloglog = glm(Hazardous ~ HypotheticalDiameter + RelativeVelocity + MissDistance + AbsoluteMagnitude, family = binomial(link = "cloglog"),data=train)

AIC = 32700

For the threshold \(t=0.3\) we have:

  • Accuracy: \(86.85\%\)
  • Precision: \(92.10\%\)
  • Recall: \(93.25\%\)
  • AUROC: \(64.95\%\)
  • F1-Score: \(92.67\%\)

The best model

It’s time to choose the best model among the six analyzed. The most important thing to know from this analysis is whether an asteroid is potentially harmful for our planet.
In order to complete this task, the best model for this analysis is the one with highest precision (the percentage of true positives over the predicted positives) and the highest recall (the percentage of true positives over all the effectively positives).
It’s better to discover as most true positives as possible, sacrificing some false positives, than not notice that some asteroids are very dangerous!

Thanks to the frequentist approach, I discovered that every parameter is very significant (***) for the model.

Now let’s look at the models with the optimal thresholds in order to choose the best one for this problem

BestThreshold Accuracy Precision Recall AUROC F1_score
Probit Bayesian 0.35 0.878 0.941 0.924 0.614 0.933
Probit Frequentist 0.25 0.875 0.935 0.927 0.628 0.931
Logit Bayesian 0.25 0.847 0.881 0.945 0.705 0.912
Logit Frequentist 0.25 0.862 0.910 0.935 0.661 0.922
Cloglog Bayesian 0.30 0.851 0.889 0.942 0.691 0.915
Cloglog Frequentist 0.30 0.868 0.921 0.932 0.649 0.926

Given the optimal thresholds for our problem, the best model for the prediction is the Bayesian Probit model.
In fact this model guarantees high performance in detecting true positives, avoiding false negatives. The model has the highest overall Accuracy and Precision, so it’s the best one.

Bayesian_DIC Frequentist_AIC
Probit 83328.3 32250
Logit 118166.2 32500
Cloglog 32698.5 32700

DIC and AIC are information criteria. The lower is the Criteria, the best could be the model.
In this case the best Bayesian model would be the Cloglog, and the best frequentist would be the Probit.

Even if these Information Criteria suggest to chose a different model, my predictions suggest to chose the Bayesian Probit model to study this problem.

Conclusions

In conclusion, all the models have fitted well with the data.
The models differ very slightly, but for this kind of analysis the best model is the probit one.

References

NASA - Nearest Object: https://www.kaggle.com/datasets/sameepvani/nasa-nearest-earth-objects

Gibbs Sampling for the Probit Regression Model: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.154.268&rep=rep1&type=pdf